Why is this problem important to solve?
The classification of which customers can have a loan from the bank is important so the bank does not lend money to customers that are defaulters (customers who don't pay back). Since defaulters are one of the main reasons for the revenue loss on banks. But also keeping the customers that are likely to pay back. Using machine learning this problem can be solved with less human error and less biases.
What is the intended goal?
The goal is to make a model that can predict the customer's classification of being defaulters or not. This model will alsoo give certaing recommendations to the bank which are in the conclusion and recommendations section.
What are the main keys for a customer to be defaulter?
We are trying to solve the decision making of giving a bank loan. This decision making using data science will be faster and less biased than if this same decision makinng was made by a human.
The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.
BAD: 1 = Client defaulted on loan, 0 = loan repaid
LOAN: Amount of loan approved.
MORTDUE: Amount due on the existing mortgage.
VALUE: Current value of the property.
REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB: The type of job that loan applicant has such as manager, self, etc.
YOJ: Years at present job.
DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE: Age of the oldest credit line in months.
NINQ: Number of recent credit inquiries.
CLNO: Number of existing credit lines.
DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.
import warnings
warnings.filterwarnings("ignore")
# Libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# Algorithms to use
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, f1_score, recall_score
from sklearn import metrics
# For hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Save the name of the file path to make it easier for modifications
file_path = '/content/drive/MyDrive/Colab Notebooks/Capstone Project/hmeq.csv'
# Read the file
Loan_Default = pd.read_csv(file_path)
# Make a copy of the data frame for modifications
data = Loan_Default.copy()
# Looking at the first 5 elements of the data
data.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
# Looking at the last 5 elements of the data
data.tail()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5955 | 0 | 88900 | 57264.0 | 90185.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 221.808718 | 0.0 | 16.0 | 36.112347 |
| 5956 | 0 | 89000 | 54576.0 | 92937.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 208.692070 | 0.0 | 15.0 | 35.859971 |
| 5957 | 0 | 89200 | 54045.0 | 92924.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 212.279697 | 0.0 | 15.0 | 35.556590 |
| 5958 | 0 | 89800 | 50370.0 | 91861.0 | DebtCon | Other | 14.0 | 0.0 | 0.0 | 213.892709 | 0.0 | 16.0 | 34.340882 |
| 5959 | 0 | 89900 | 48811.0 | 88934.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 219.601002 | 0.0 | 16.0 | 34.571519 |
# Looking at the info of the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
Observations:
There are 5960 entries and not all of the columns have information in all of the entries. Some columns have missing information, such as DEBTINC which is the column that has less entries.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| BAD | 5960.0 | 0.199497 | 0.399656 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| LOAN | 5960.0 | 18607.969799 | 11207.480417 | 1100.000000 | 11100.000000 | 16300.000000 | 23300.000000 | 89900.000000 |
| MORTDUE | 5442.0 | 73760.817200 | 44457.609458 | 2063.000000 | 46276.000000 | 65019.000000 | 91488.000000 | 399550.000000 |
| VALUE | 5848.0 | 101776.048741 | 57385.775334 | 8000.000000 | 66075.500000 | 89235.500000 | 119824.250000 | 855909.000000 |
| YOJ | 5445.0 | 8.922268 | 7.573982 | 0.000000 | 3.000000 | 7.000000 | 13.000000 | 41.000000 |
| DEROG | 5252.0 | 0.254570 | 0.846047 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 |
| DELINQ | 5380.0 | 0.449442 | 1.127266 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 15.000000 |
| CLAGE | 5652.0 | 179.766275 | 85.810092 | 0.000000 | 115.116702 | 173.466667 | 231.562278 | 1168.233561 |
| NINQ | 5450.0 | 1.186055 | 1.728675 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 17.000000 |
| CLNO | 5738.0 | 21.296096 | 10.138933 | 0.000000 | 15.000000 | 20.000000 | 26.000000 | 71.000000 |
| DEBTINC | 4693.0 | 33.779915 | 8.601746 | 0.524499 | 29.140031 | 34.818262 | 39.003141 | 203.312149 |
Observations:
The mean for the people that are defaulters is low, meaning the amount of defaulters is low.
The loan amount has an average of $18,608 among the users.
The mortgage due amount has the mean of $73,760, but some values are missing.
The value mean is $101,776
The years at a job is around 8.9 years
Number of major derogatory reports is of 0.25, which is less than half but still could be considered high due to the meaning of these reports.
Number of delinquent credit lines has a mean of 0.44, almost half which is high due to the fact that users are not paying the minimum amount required.
The mean of credits being active is around 21 months.
The Debt-to-income ratio has a mean of 33.7 which means that the debt is usually higher than the monthly gross income of each user.
# Making a list for the variables that are numerical
num_cols = data.select_dtypes('number').columns
num_variable_list = list(data[num_cols].columns.values.tolist())
num_variable_list
['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
# Making a list for the variables that are object/strings
cat_cols = data.select_dtypes('object').columns # Selecting numerical columns
cat_variable_list = list(data[cat_cols].columns.values.tolist())
cat_variable_list
['REASON', 'JOB']
# Getting the percentages of each categorical feature
for column in cat_cols:
print("-----",column,"-----")
print(data[column].value_counts(normalize = True))
print("-" * 50)
----- REASON ----- DebtCon 0.688157 HomeImp 0.311843 Name: REASON, dtype: float64 -------------------------------------------------- ----- JOB ----- Other 0.420349 ProfExe 0.224608 Office 0.166872 Mgr 0.135011 Self 0.033973 Sales 0.019187 Name: JOB, dtype: float64 --------------------------------------------------
Observations:
The reason of debt is mainly because of 'DebtCon'. And the job description is mostly 'Other' and the second highest is 'ProfExe', maybe the bank should have more options so less users click on 'Other'.
# Count the NaN values in each column
null_count = data.isna().sum()
print(null_count.sort_values())
BAD 0 LOAN 0 VALUE 112 CLNO 222 REASON 252 JOB 279 CLAGE 308 NINQ 510 YOJ 515 MORTDUE 518 DELINQ 580 DEROG 708 DEBTINC 1267 dtype: int64
# This is the null count in percentage
for i in null_count:
new_null_count = null_count/5960
print(new_null_count.sort_values())
BAD 0.000000 LOAN 0.000000 VALUE 0.018792 CLNO 0.037248 REASON 0.042282 JOB 0.046812 CLAGE 0.051678 NINQ 0.085570 YOJ 0.086409 MORTDUE 0.086913 DELINQ 0.097315 DEROG 0.118792 DEBTINC 0.212584 dtype: float64
Observations:
Due to the high missing values on some of the columns, I find it more valuable to start the missing value treatment before doing the EDA.
missing value treatment
# First we convert the BAD column to categorical.
cols_cat = data.select_dtypes(['object']).columns.tolist()
cols_cat.append('BAD') # We add BAD
# For loop will convert all the entries to str so they are treated as categorical variables
data['BAD']=data['BAD'].astype(np.object)
print(data.info()) # We confirm that BAD is an object dtype.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null object 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(1), object(3) memory usage: 605.4+ KB None
# Treat Missing values in numerical columns with median and mode in categorical variables
# Select numeric columns
num_data = data.select_dtypes('number')
# Select string and object columns
cat_data = data.select_dtypes('object').columns.tolist()
# Fill numeric features with median
data[num_data.columns] = num_data.fillna(num_data.median())
for column in cat_data:
mode = data[column].mode()[0]
data[column] = data[column].fillna(mode)
#print(column,mode) #to check the iterations
data.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | 34.818262 |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | 34.818262 |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | 34.818262 |
| 3 | 1 | 1500 | 65019.0 | 89235.5 | DebtCon | Other | 7.0 | 0.0 | 0.0 | 173.466667 | 1.0 | 20.0 | 34.818262 |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | 34.818262 |
Leading Questions:
# Plot distribution and boxplots for numerical variables
for col in num_data:
print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1,2,1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
LOAN Skew : 2.02
MORTDUE Skew : 1.94
VALUE Skew : 3.09
YOJ Skew : 1.09
DEROG Skew : 5.69
DELINQ Skew : 4.25
CLAGE Skew : 1.39
NINQ Skew : 2.77
CLNO Skew : 0.8
DEBTINC Skew : 3.11
Observations:
asdf
# Bar plot of the DELINQ and its count
plt.figure(figsize = (10, 6))
ax = sns.countplot(x = 'DELINQ', data = data)
# Get the number count for each column
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()))
plt.show()
# Bar plot of each categorical variable, with the status distinction.
for col in cat_data:
print(col)
plt.figure(figsize = (10, 6))
ax = sns.countplot(x = col, data = data)
# Get the number count for each column
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()))
plt.show()
BAD
REASON
JOB
# Getting boxplots for each numerical feature by status
for col in num_data:
print(col)
plt.figure(figsize = (10, 5))
sns.boxplot(y=data[col], x=data["BAD"])
plt.show()
LOAN
MORTDUE
VALUE
YOJ
DEROG
DELINQ
CLAGE
NINQ
CLNO
DEBTINC
# Bar plot of each categorical variable, with the status distinction.
for col in cat_data:
print(col)
plt.figure(figsize = (10, 6))
ax = sns.countplot(x = col, hue = 'BAD', data = data)
# Get the number count for each column
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()))
plt.show()
BAD
REASON
JOB
def stacked_plot(x): sns.set(palette='nipy_spectral') tab1 = pd.crosstab(x,data['BAD'],margins=True) print(tab1) print('-'*120) tab = pd.crosstab(x,data['BAD'],normalize='index') tab.plot(kind='bar',stacked=True,figsize=(10,5)) plt.legend(loc='lower left', frameon=False) plt.legend(loc="upper left", bbox_to_anchor=(1,1)) plt.show()
Plot stacked bar plot for BAD and REASON stacked_plot(data['REASON'])
# Correlations between all the variables using a heatmap
plt.figure(figsize = (12, 7))
sns.heatmap(data.corr(), annot = True)
plt.show()
# create a pair plot by status to find any relationships
#sns.pairplot(data, hue ='BAD')
# Make a list of the features that will have the outlier treatment
outlier_list = [
'LOAN',
'MORTDUE',
'VALUE',
'YOJ',
'CLAGE',
'NINQ',
'CLNO',
'DEBTINC']
# Using the IQR to treat outliers
def treat_outliers(df,col):
Q1=df[col].quantile(0.25) # 25th quantile
Q3=df[col].quantile(0.75) # 75th quantile
IQR=Q3-Q1 # IQR Range
Lower_Whisker = Q1-IQR
Upper_Whisker = Q3+IQR
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker) # used to limit the values with the upper and lower whiskers
return df
# Using the past function to do it in all the numerical variables
def treat_outliers_all(df, col_list):
for c in col_list:
df = treat_outliers(df,c)
return df
data_no_outlier = data.copy()
numerical_col = outlier_list #list of numerical columns except BAD
df = treat_outliers_all(data_no_outlier,numerical_col)
# Borrowed function from making case studies that will create boxplot and histogram for any input numerical variable.
# This function takes the numerical column as the input and return the boxplots and histograms for the variable.
def histogram_boxplot(feature, figsize=(15,10), bins = None):
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
sharex = True, # x-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
) # creating the 2 subplots
sns.boxplot(feature, ax=ax_box2, showmeans=True, color='pink', orient="h") # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins,palette="summer") if bins else sns.distplot(feature, kde=False, ax=ax_hist2) # For histogram
ax_hist2.axvline(np.mean(feature), color='red', linestyle='--') # Add mean to the histogram
ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram
# Plot the histogram and boxplot with outliers
for col in num_data:
histogram_boxplot(data[col])
# Plot the histogram and boxplot without outliers
for col in num_data:
histogram_boxplot(df[col])
Observations:
The treatment of outliers worked as seen on the previous plots and the data is ready to do the modeling.
# Correlations between all the variables using a heatmap
plt.figure(figsize = (12, 7))
sns.heatmap(data_no_outlier.corr(), annot = True)
plt.show()
Observations:
The correlations across the dataset has one strong positive correlation which is between the value and the mortgage due. There are also other positive correlations but are weaker like: Loan and value, Loan and Mortgage, CLNO and the mort and the value, DEROG and BAD, DELINQ and BAD, among others.
# create a pair plot by status to find any relationships
sns.pairplot(data, hue ='BAD')
<seaborn.axisgrid.PairGrid at 0x7f294a1a4c70>
What are the the most important observations and insights from the data based on the EDA performed?
Leading Questions:
The range of the loans are between 1,100 to 89,900
The distribution of the years on the job show that there are more customers with less than 7 years on the job.
Two, DebtCon and Home improvement
The most common variable is Other.
The more defaults come from DebtCon. But proportionally there's not much difference.
Applicants with larger loans don't default. And the applicants with lower loans tend to default more.
Yes there is a weak positive correlation between the loan and the value of the property.
No, there's no significant difference.
# Data Preparation
#clean data set for logistic regression
df_LR = df.copy()
# A new copy for the modeling
df_ready = df.copy()
# Drop the dependent variable from the dataframe and create the X(independent variable) matrix
X = df_ready.drop(columns = 'BAD')
# Create dummy variables for the categorical variables - Hint: use the get_dummies() function
X = pd.get_dummies(X, drop_first = True)
# Create y(dependent varibale)
Y = df_ready['BAD']
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)
#Split the data into training and testing
# 70/30 split
# All rows where 'BAD' column is 1
input_ones = df_LR[df_LR['BAD'] == 1]
# All rows where 'BAD' column is 0
input_zeros = df_LR[df_LR['BAD'] == 0]
# For repeatability of sample
np.random.seed(100)
input_ones_training_rows = np.random.choice(input_ones.index, int(0.7 * len(input_ones)), replace=False)
input_zeros_training_rows = np.random.choice(input_zeros.index, int(0.7 * len(input_zeros)), replace=False)
# Pick as many 0 and 1
training_ones = input_ones.loc[input_ones_training_rows]
training_zeros = input_zeros.loc[input_zeros_training_rows]
# Concatenate
trainingData = pd.concat([training_ones, training_zeros])
# Create test data
test_ones = input_ones.drop(input_ones_training_rows)
test_zeros = input_zeros.drop(input_zeros_training_rows)
# Concatenate
testData = pd.concat([test_ones, test_zeros])
#check for imbalance
bad_counts = trainingData['BAD'].value_counts()
print(bad_counts)
print('____________________')
# check class distrubution
class_distribution = trainingData['BAD'].value_counts(normalize=True)
print(class_distribution)
0 3339 1 832 Name: BAD, dtype: int64 ____________________ 0 0.800527 1 0.199473 Name: BAD, dtype: float64
# Visualize class distribution using a bar plot
plt.figure(figsize=(8, 6))
sns.barplot(x=class_distribution.index, y=class_distribution.values)
plt.xlabel('0: Payed, 1: Defaulted')
plt.ylabel('Ratio')
plt.title('Class Distribution')
plt.show()
Observations:
Looking at the barplot the distribution is clearly from 0 to 0.8 for the customers that pay and 0 to 0.2 for the customers that default.
# Function to print classification report and get confusion matrix in a proper format
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (8, 5))
sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Not Eligible', 'Eligible'], yticklabels = ['Not Eligible', 'Eligible'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
#Create a table to add the results of the model
results = pd.DataFrame(columns = ['Model_Name','Train_f1','Train_recall','Train_precision','Test_f1','Test_recall','Test_precision'])
results.head()
| Model_Name | Train_f1 | Train_recall | Train_precision | Test_f1 | Test_recall | Test_precision |
|---|
#Defining Decision tree model using the previous class weights for imbalances
d_tree_base = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.2, 1: 0.8})
# Fitting the decision tree classifier on the training data
d_tree = DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8} ,random_state = 7)
d_tree.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=7)# Checking performance on the training data
y_pred_train_d_tree = d_tree.predict(X_train)
metrics_score(y_train, y_pred_train_d_tree)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
# Checking performance on the testing data
y_pred_test_d_tree = d_tree.predict(X_test)
metrics_score(y_test, y_pred_test_d_tree)
precision recall f1-score support
0 0.90 0.93 0.91 1416
1 0.69 0.60 0.64 372
accuracy 0.86 1788
macro avg 0.79 0.76 0.78 1788
weighted avg 0.85 0.86 0.86 1788
# Adding the results to the table
new_row = {'Model_Name': 'd_tree_base',
'Train_f1': 100,
'Train_recall': 100,
'Train_precision':100,
'Test_f1': 64,
'Test_recall': 60,
'Test_precision': 69}
results = pd.concat([results, pd.DataFrame([new_row])], ignore_index=True)
# Print the updated DataFrame
print(results)
Model_Name Train_f1 Train_recall Train_precision Test_f1 Test_recall \ 0 d_tree_base 100 100 100 64 60 Test_precision 0 69
Criterion {“gini”, “entropy”}
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_depth
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf
The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
You can learn about more Hyperpapameters on this link and try to tune them.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
# Choose the type of classifier.
d_tree_tuned = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.2, 1: 0.8})
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 6), #[2, 3, 4, 5 ]
'criterion': ['gini', 'entropy'], #use both gini and entropy to measure split quality
'min_samples_leaf': [5, 10, 20, 25] #minimum number of samples to be a leaf node
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data
d_tree_tuned.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=5, min_samples_leaf=10, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=5, min_samples_leaf=10, random_state=7)# Checking performance on the training data based on the tuned model
y_pred_train_d_tree_tuned = d_tree_tuned.predict(X_train)
metrics_score(y_train,y_pred_train_d_tree_tuned)
precision recall f1-score support
0 0.94 0.90 0.92 3355
1 0.66 0.77 0.71 817
accuracy 0.88 4172
macro avg 0.80 0.84 0.82 4172
weighted avg 0.89 0.88 0.88 4172
# Checking performance on the testing data based on the tuned model
y_pred_test_d_tree_tuned = d_tree_tuned.predict(X_test)
metrics_score(y_test,y_pred_test_d_tree_tuned)
precision recall f1-score support
0 0.92 0.91 0.92 1416
1 0.68 0.72 0.70 372
accuracy 0.87 1788
macro avg 0.80 0.81 0.81 1788
weighted avg 0.87 0.87 0.87 1788
# Adding the results to the table
new_row = {'Model_Name': 'd_tree_base_tuned',
'Train_f1': 71,
'Train_recall': 77,
'Train_precision':66,
'Test_f1': 70,
'Test_recall': 72,
'Test_precision': 68}
results = pd.concat([results, pd.DataFrame([new_row])], ignore_index=True)
# Print the updated DataFrame
print(results)
Model_Name Train_f1 Train_recall Train_precision Test_f1 \ 0 d_tree_base 100 100 100 64 1 d_tree_base_tuned 71 77 66 70 Test_recall Test_precision 0 60 69 1 72 68
# Plot the decision tree and analyze it to build the decision rule
features = list(X.columns)
plt.figure(figsize = (40,40))
tree.plot_tree(d_tree_tuned, feature_names = features, filled = True, fontsize = 10, node_ids = True, class_names = True)
plt.show()
# Plotting the feature importance
importances = d_tree_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize = (15, 10))
plt.title('Most Important Features')
plt.barh(range(len(indices)), importances[indices], color = 'blue')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Importance Ratio')
plt.show()
Observations:
The main features that involve having a default are: DEBTINC, DELINQ, CLAGE, MORTDUE.
Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.
The results from all the decision trees are combined together and the final prediction is made using voting or averaging.
# Defining Random forest CLassifier
rf_estimator = RandomForestClassifier(random_state=7,criterion="entropy")
rf_estimator.fit(X_train,y_train)
RandomForestClassifier(criterion='entropy', random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', random_state=7)
#Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(X_train)
metrics_score(y_train,y_pred_train_rf)
precision recall f1-score support
0 1.00 1.00 1.00 3355
1 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
# Checking performance on the test data
y_pred_test_rf = rf_estimator.predict(X_test)
metrics_score(y_test, y_pred_test_rf)
precision recall f1-score support
0 0.92 0.98 0.95 1416
1 0.88 0.66 0.76 372
accuracy 0.91 1788
macro avg 0.90 0.82 0.85 1788
weighted avg 0.91 0.91 0.91 1788
# Adding the results to the table
new_row = {'Model_Name': 'Random Forest Classifier',
'Train_f1': 100,
'Train_recall': 100,
'Train_precision':100,
'Test_f1': 76,
'Test_recall': 66,
'Test_precision': 88}
results = pd.concat([results, pd.DataFrame([new_row])], ignore_index=True)
# Print the updated DataFrame
print(results)
Model_Name Train_f1 Train_recall Train_precision Test_f1 \ 0 d_tree_base 100 100 100 64 1 d_tree_base_tuned 71 77 66 70 2 Random Forest Classifier 100 100 100 76 Test_recall Test_precision 0 60 69 1 72 68 2 66 88
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [100, 110],
"max_depth": [5,6],
"max_leaf_nodes": [8,10],
"min_samples_split":[20],
'criterion': ['gini'],
"max_features": ['sqrt'],
"class_weight": ["balanced",{0: 0.2, 1: 0.8}]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
# Fitting the best algorithm to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=8,
min_samples_split=20, n_estimators=110, random_state=7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=8,
min_samples_split=20, n_estimators=110, random_state=7)# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(X_train)
metrics_score(y_train, y_pred_train_rf_tuned)
precision recall f1-score support
0 0.95 0.83 0.88 3355
1 0.53 0.82 0.65 817
accuracy 0.82 4172
macro avg 0.74 0.82 0.77 4172
weighted avg 0.87 0.82 0.84 4172
# Checking performance on the training data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(X_test)
metrics_score(y_test, y_pred_test_rf_tuned)
precision recall f1-score support
0 0.94 0.83 0.88 1416
1 0.55 0.79 0.65 372
accuracy 0.82 1788
macro avg 0.74 0.81 0.76 1788
weighted avg 0.86 0.82 0.83 1788
# Plot the most important features
importances = rf_estimator_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize = (5, 5))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'blue')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# Adding the results to the table
new_row = {'Model_Name': 'Random Forest Classifier Tuned',
'Train_f1': 65,
'Train_recall': 82,
'Train_precision':53,
'Test_f1': 65,
'Test_recall': 79,
'Test_precision': 55}
results = pd.concat([results, pd.DataFrame([new_row])], ignore_index=True)
# Print the updated DataFrame
print(results)
results
Model_Name Train_f1 Train_recall Train_precision \ 0 d_tree_base 100 100 100 1 d_tree_base_tuned 71 77 66 2 Random Forest Classifier 100 100 100 3 Random Forest Classifier Tuned 65 82 53 Test_f1 Test_recall Test_precision 0 64 60 69 1 70 72 68 2 76 66 88 3 65 79 55
| Model_Name | Train_f1 | Train_recall | Train_precision | Test_f1 | Test_recall | Test_precision | |
|---|---|---|---|---|---|---|---|
| 0 | d_tree_base | 100 | 100 | 100 | 64 | 60 | 69 |
| 1 | d_tree_base_tuned | 71 | 77 | 66 | 70 | 72 | 68 |
| 2 | Random Forest Classifier | 100 | 100 | 100 | 76 | 66 | 88 |
| 3 | Random Forest Classifier Tuned | 65 | 82 | 53 | 65 | 79 | 55 |
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
The best technique was the Random Forest Tuned, having a 79% of recall but 55% of precision.
2. Refined insights:
The most important insights is that DEBTINC, DELINQ, DEROG, and CLAGE are the main feaures. These features appeared to be relevantin both the decision tree and the random forest.
3. Proposal for the final solution design:
The best model to adopt is the Random Forest tuned, as this is the model that has the recall more maximized. Even though F-1 is not as high in this dataset is more beneficial to maximize the recall.